Skip to content

fix(deploy): raise heap + memory limit for full published import#131

Merged
themightychris merged 2 commits into
mainfrom
fix/boot-oom-heap-bump
Jun 25, 2026
Merged

fix(deploy): raise heap + memory limit for full published import#131
themightychris merged 2 commits into
mainfrom
fix/boot-oom-heap-bump

Conversation

@themightychris

@themightychris themightychris commented Jun 25, 2026

Copy link
Copy Markdown
Member

Why — incident: sandbox boot OOM (and the over-correction that followed)

Deploying the home-CTA fix (#128) included a rollout restart. The Recreate strategy terminated the only replica, and the new pod crash-looped on boot with:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Root cause (pre-existing, not the CTA change): a cold boot rebuilds in-memory state from the full published import — ~31.8k people, ~10.4k tag-assignments, 1k tags, 268 projects, plus secondary indices. That no longer fits in the prior 1.5 GB V8 old-space. The long-running pod had been serving state from an earlier, lighter boot; the restart forced the first cold load of the current import. (FTS5 is better-sqlite3 / off-heap, so the V8 heap holds the record maps + indices.)

Over-correction (resolved): the first fix here raised heap→3072 / limit→3.5Gi. On these ~3.9Gi nodes that let the pod grow until it starved a node's kubelet → NodeNotReady, cascading into an RWO volume multi-attach deadlock. Corrected to node-safe values, which have run stable for 8h.

What

  • configmap.yaml: NODE_OPTIONS --max-old-space-size 1536 → 2048
  • deployment.yaml: container memory limit 2Gi → 2560Mi, requests 768Mi → 1Gi

2048 boots the full import cleanly; capping the container at 2.5Gi leaves ~1.4Gi node headroom so a single pod can't take a node down again.

Status

  • Live deployment is already on these values (applied during the incident) — 8h stable, 0 restarts. This PR makes the repo/GitOps match what's running.
  • Also resolved a latent kustomize drift along the way: the live Deployment's selector predated managed-by: kustomize being added to the selector, so apply failed spec.selector: immutable. The Deployment was deleted + recreated, so it now matches the rendered selector and future apply -k is clean.

Follow-ups

🤖 Generated with Claude Code

A cold boot rebuilding in-memory state from the full `published` import
(~31.8k people, ~10.4k tag-assignments, plus secondary indices) OOM'd at
the previous 1536Mi V8 old-space ceiling: "FATAL ERROR: Reached heap limit
Allocation failed - JavaScript heap out of memory". The long-running pod had
been serving state from an earlier, lighter boot and never had to rebuild;
a rollout restart forced the first cold load of the current import and it no
longer fit.

The native FTS5 store is off-heap (better-sqlite3), so the V8 heap holds the
record maps + indices. Raise NODE_OPTIONS --max-old-space-size 1536 -> 3072
and the container memory limit 2Gi -> 3.5Gi (nodes are 3.9Gi), with requests
768Mi -> 1Gi. The ~60x on-disk-to-heap expansion is suspiciously large and
is tracked separately for a memory-optimization investigation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Correcting this branch's first attempt. The initial 3072 heap / 3.5Gi limit
restored the boot but was too large for the ~3.9Gi nodes: as the pod grew it
starved the node's kubelet and drove it NodeNotReady, which cascaded into an
RWO volume multi-attach deadlock and a longer outage.

The proven-safe values (live for 8h, 0 restarts): heap 2048 / container limit
2560Mi / request 1Gi. 2048 boots the full `published` import cleanly; capping
the container at 2.5Gi leaves ~1.4Gi node headroom so a single pod can't take
a node down again. Reducing the footprint further is tracked in the
memory-optimization issue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@themightychris themightychris merged commit 19fb503 into main Jun 25, 2026
1 check passed
@themightychris themightychris deleted the fix/boot-oom-heap-bump branch June 25, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant